%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Configuring \SIM}
\label{sec:knob}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


To control simulation and architectural parameters, knob variables defined 
in \textit{def/*.param.def} files are used. The build process automatically 
converts the knob definitions in these files into c++ source that gets included in 
the compilation of the \SIM binary. Using different parameter values for the knob 
variables \SIM can be configured to simulate different CPU, GPU and even 
heterogeneous architectures. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Adding a new knob}
\label{sec:knob1}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

A new knob variable can be defined by adding a line in the following format in 
one of the \textit{param.def} files:

\begin{Verbatim}
param<{name used in MacSim}, {name used in the command line or params file}, {knob type}, {default value}>
\end{Verbatim}

It is recommended that the first two arguments be the same, except that the
former be in upper case and the latter in lower case. For example,

\begin{Verbatim}
param<L2_ASSOC, l2_assoc, int, 8>
\end{Verbatim}


\ignore
		{ However, one restriction is this new variable must be defined in
		\textit{trunk/def} directory and the file name should be \textit{*.param.def}
		to parse variables correctly.  }

Knobs can be pretty much of any type, to support knobs of type other than the
basic data types users may have to add code to \textit{src/knob.h}. Knobs of
type string are already supported, this is helpful for specifying policies and
configurations such as branch predictor type, instruction scheduler type, dram
scheduling policy and so on as strings instead of integers.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Accessing the value of a knob in \SIM}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

In the \SIM code, a knob variable is accessed by prefixing its name (in
uppercase) with \textit{KNOB\_}.  For example, L2\_ASSOC defined in the example
in Section~\ref{sec:knob1} can be accessed in \SIM using the name
KNOB\_L2\_ASSOC.  To access the value of a knob either the \textit{getValue()}
function or the name of the knob (works because of operator overloading) can be
used. (e.g. \textit{*KNOB(KNOB\_L2\_ASSOC)})

\ignore
		{ \subsection{Adding a New String Type Knob} Several knobs especially
		setting policies use {\texttt string} type.  Examples are branch predictor,
		instruction scheduler, dram scheduler, and llc setting. Please see {\texttt
		factory\_class.cc/h} and {\texttt bp.cc}.  }

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Assigning values to knobs}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

When defining a knob in one of the \textit{*.param.def} files, a default value
for the knob has to be specified. If a user does not set the value of a knob
for a simulation, then the knob assumes its default value. If a user wishes to
change the value of a knob for a simulation, there are two ways in which the
user can accomplish this:

\ignore 
		{ \section{How to apply different value to a knob variables} There are
		two ways of modifying the default value of a knob variable.  }

\begin{enumerate}
\item edit params.in - this file is read by the \SIM binary on startup for
parameter values.  Sample parameter files with parameter values for different
configurations can be found in \textit{params/} directory. Each line in
a parameter file consists of a knob name (in lowercase) followed by the
parameter value. For example:
\begin{Verbatim}
l1_assoc 8
l2_assoc 16
\end{Verbatim}


\item specify parameter values from the command line - an user can specify
knobs and their values from the command line as shown below.

\begin{Verbatim}
./macsim --l1_assoc=8 --l2_assoc=16
\end{Verbatim}
\end{enumerate}

For a knob with values specified in the params.in file as well as the command
line, the value specified in the command line takes priority over the value
specified in the params.in file.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Setting up Parameters}
\label{sec:parameter}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This section shows how different kinds of simulations and simulation
configurations can be achieved using knobs and parameter values. \SIM can model up
to three types of cores, SMALL, MEDIUM and LARGE, in a simulation. Equivalent
knobs are provided for each core type for configuration purposes. Knobs for
medium and large cores use \Verb+medium+ and \Verb+large+ in their names, while
knobs for small cores do not use any such identifiers in their names. For
example, the knob rob\_size sets the length of the ROB for a small core. The
equivalent knobs for medium and large cores are rob\_medium\_size and
rob\_large\_size.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Repeating Traces}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

For multiple-application simulations, sometimes, early-terminating applications
have to be re-executed to model resource contention (cache, on-chip
interconnection, and memory controller) until all applications finish. This
is a common methodology adopted in evaluating multi-program workloads and is
supported by \SIM too. To enable this feature, the repeate\_trace knob must be
turned on i.e. set repeat\_trace to 1. This can be done either via params.in
file or from the command line. 

\ignore
		{
		\begin{itemize}
		  \item Specify the configuration in the command line by
		  \begin{Verbatim}
		  ./macsim --repeat_trace=1
		  \end{Verbatim}

		  \item Or you can write this in \textit{params.in} file
		  \begin{Verbatim}
		  repeat_trace 1
		  \end{Verbatim}
		\end{itemize}
		}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{\cpu Experiments}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{One \cpu Core}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Usually, cores of type large are configured as \cpu cores, however, this is not
mandatory. 

\begin{Verbatim}
// params/params_x86
num_sim_cores 1
num_sim_small_cores 0
num_sim_medium_cores 0
num_sim_large_cores 1
large_core_type x86
\end{Verbatim}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Multiple \cpu cores}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

For multi-core simulations users have to specify the number of cores as greater than one.


\begin{Verbatim}
// 4-core simulation
num_sim_cores 4
num_sim_small_cores 0
num_sim_medium_cores 0
num_sim_large_cores 4
large_core_type x86
repeat_trace 1
\end{Verbatim}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{2-way SMT x86 Core}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\SIM also supports the simultaneous multi-threading (SMT) features.
These parameters are used for the SMT configurations:

\Verb+ max_threads_per_core+, \Verb+max_threads_per_medium_core+, and 
\Verb+max_thread_per_large_core+. 

\noindent For example:
\begin{Verbatim}
// 1-core 2-way SMT configuration
num_sim_cores 1
num_sim_small_cores 0
num_sim_medium_cores 0
num_sim_large_cores 1
large_core_type x86
max_threads_per_large_core 2
repeat_trace 1
\end{Verbatim}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{\cpu Core Parameters}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

By default these parameter values are applied to large cores. 

\begin{tabular}{l l}
 \Verb+large_width 2+ & \Verb+// pipeline width (the entire pipelines use the same width)+ \\
 \Verb+large_core_fetch_latency 5+ & \Verb+// front-end depth+ \\
 \Verb+large_core_alloc_latency 5+ & \Verb+// decode/allocation depth+ \\
 \Verb+bp_dir_mech gshare+ & \Verb+// this is common to all core types+ \\
 \Verb+bp_hist_length 14+ & \Verb+// branch history length+ \\
 \Verb+isched_large_rate 4 + & \Verb+// # of integer instructions that can be executed per cycle+ \\
 \Verb+msched_large_rate 2+ & \Verb+// # of memory instructions that can be executed per cycle + \\
 \Verb+fsched_large_rate 2 + & \Verb+// # of FP instructions that can be executed per cycle+ \\
 \Verb+large_core_schedule  io + & \Verb+// in order instruction scheduling, set to "ooo" for out of order scheduling + \\
 \Verb+rob_large_size 96+ & \Verb+// ROB size+ \\
 \Verb+fetch_policy rr+ & \Verb+// SMT(MT) thread fetch policy  by default: round-robin, common to all core types+
\end{tabular}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{\gpu Simulations}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Usually cores of type small are configured as \gpu cores. Several pre-defined
parameter files for simulating NVIDIA architectures are provided, below is the
list of GPU parameter files provided with \SIM.

\begin{center}
 \begin{tabular}{| l | l |}
  \hline
  File Name & Description \\ \hline \hline
  params\_8800gt &  NVIDIA GeForce 8800GT (G80 architecture) \\
  params\_gtx280 &  NVIDIA GeForce GTX280 (GT200 architecture) \\
  params\_gtx465, params\_gtx480 & NVIDIA GeForece GTX465, GTX480 (Fermi architecture) \\
  \hline
 \end{tabular}
\end{center}

\noindent Below are some sample configurations:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection*{\gpu with One Application}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{tabular}{l l}
 \Verb+// 12-SM simulations+ & \Verb++ \\
 \Verb+num_sim_small_cores 12+ & \Verb++ \\
 \Verb+core_type ptx+ & \Verb++ \\
 \Verb+max_threads_per_core 80+ & \Verb+// set the max number of warps per SM+
\end{tabular}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection*{\gpu with Multiple Applications}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{tabular}{l l}
 \Verb+// 12-SM simulations, 6 SMs for each application+ & \Verb++ \\
 \Verb+num_sim_cores 12+ & \Verb++ \\
 \Verb+num_sim_small_cores 12+ & \Verb++ \\
 \Verb+core_type ptx+ & \Verb++ \\
 \Verb+max_threads_per_core 80+ & \Verb+// set the max number of warps + \\
 \Verb+max_num_core_per_appl 6+ & \Verb+// 6 SMs for each application+ \\
 \Verb+repeat_trace 1+ & \Verb+// for multi-program workload simulation+
\end{tabular}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Number of Thread Blocks Per Core}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The Maximum number of thread blocks per core for a kernel is determined by
several factors: 
\begin{enumerate}
 \item number of threads in each thread block
 \item number of registers used by each thread
 \item amount of shared memory required
 \item GPU architecture (the CUDA compute version)\footnote{note that the CUDA 
 compute version is set by the user while generating traces}.
\end{enumerate}
\noindent The PTX trace generator calculates the maximum thread blocks per core based on the 
values provided by Ocelot for these variables includes it the trace output. The
calculation is similar to what is done by the CUDA occupancy calculator. Users
can override this value for all GPU applications by setting the
\Verb+max_block_per_core_super+ knob.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{\gpu Core Parameters}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{tabular}{l l}
 \Verb+schedule_ratio 4+ & \Verb+\\ schedule instructions on every 4 cycle+ \\
 \Verb+fetch_ratio  4+ & \Verb+\\ fetch new instructions on every 4 cycle+ \\
 \Verb+gpu_sched 1+ & \Verb+\\ use GPU scheduler for GPU cores+ \\ %\todo{this knob should be removed!, it is unnecessary now}
 \Verb+const_cache_size  1024+ & \Verb+\\ 1024B constant cache+ \\
 \Verb+texture_cache_size 1024+ & \Verb+\\ 1024B texture cache (currently, each core has a private texture cache)+ \\
 \Verb+shared_mem_size 4096+ & \Verb+\\ 4096B shared memory size + \\
 \Verb+ptx_exec_ratio  4+ & \Verb+\\ factor by which latency values defined in uoplatency_ptx.def+ \\
 \Verb++			     & \Verb+\\ must be multiplied for actual PTX instruction latency+ \\
 \Verb+num_warp_scheduler 2+ & \Verb+\\ the number of warps to schedule whenever instruction scheduler is run+
\end{tabular}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Heterogeneous Architecture Simulations}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

For CPU-GPU heterogeneous simulations, an architecture similar to Intel's Sandy
Bridge~\cite{sandybridge} is modeled. However, the \gpu cores in this model are
similar to NVIDIA's Fermi~\cite{fermi} architecture. Two sample parameter files 
for a heterogeneous configuration is also provided.

\begin{center}
 \begin{tabular}{| l | l |}
  \hline
  File Name & Description \\ \hline \hline
  params\_hetero\_1\_6 &  1-CPU, 6-GPU cores \\
  params\_hetero\_4c\_4g &  4-CPU, 4-GPU cores \\
  \hline
 \end{tabular}
\end{center}

\noindent Following example shows a simple heterogeneous configuration.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection*{One CPU application + One GPU application}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{Verbatim}
num_sim_cores 2
num_sim_small_cores 1
num_sim_medium_cores 0
num_sim_large_cores 1
core_type ptx
large_core_type x86
cpu_frequency 3
gpu_frequency 1.5
repeat_trace 1
\end{Verbatim}


Although the above configuration sets up the number of CPU and GPU cores
correctly, users still have to setup each core types individually. Please refer
to sample files for other parameter values.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection*{Multiple CPU applications + Multiple GPU applications}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{Verbatim}
num_sim_cores 8
num_sim_small_cores 4
num_sim_medium_cores 0
num_sim_large_cores 4
core_type ptx
large_core_type x86
cpu_frequency 3
gpu_frequency 1.5
repeat_trace 1
\end{Verbatim}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Cache Configuration}
\label{sec:knob:cache}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The cache can be configured using the following knobs:

\begin{tabular}{l l}
 \\
 \Verb+l{1,2}_{small, medium, large}_num_set+ & \Verb+// number of sets+ \\
 \Verb+l{1,2}_{small, medium, large}_assoc+ & \Verb+// associativity+ \\
 \Verb+l{1,2}_{small, medium, large}_line_size+ & \Verb+// cache line size+ \\
 \Verb+l{1,2}_{small, medium, large}_num_bank+ & \Verb+// number of banks+ \\
 \Verb+l{1,2}_{small, medium, large}_latency + & \Verb+// cache latency+ \\
 \Verb+l{1,2}_{small, medium, large}_bypass + & \Verb+// cache bypass (if set, always miss)+ \\
 \Verb+num_llc + & \Verb+// number of llc cache tiles+ \\
 \Verb+llc_num_set + & \Verb+// number of llc cache sets+ \\
 \Verb+llc_assoc + & \Verb+// llc associativity+ \\
 \Verb+llc_line_size+ & \Verb+// llc line size+ \\
 \Verb+llc_num_bank+ & \Verb+// llc number of banks+ \\
 \Verb+llc_latency + & \Verb+// llc latency+ \\
 \Verb+l{1,2,3}_{read,write}_port + & \Verb+// the number of read / write port+ \\
 \Verb+icache_num_set 8 + & \Verb+// 4KB I-cache number of set+ \\
 \Verb+icache_assoc   8 /+ & \Verb+/ I cache set associativity + \\
 \\
\end{tabular}

\noindent The effective cache size can be calculated using Equation~\ref{eq:cachesize}.
\begin{equation}
\label{eq:cachesize}
cache\_size = num\_set \times assoc \times line\_size \times num\_tiles (llc only, otherwise 1)
\end{equation}


\noindent For instance, the size of a cache with 256 sets, 16 ways per set, 64B per cache lines and 4
tiles is $256 \times 16 \times 64 \times 4 = 1\textit{MB}$

Cache latency is determined by several factors including the
size, technology, and the number of ports. Cacti~\cite{cacti} can be
used to model cache latency accurately. The cache line size is set to
64B by default. Although the cache line size can be any power of 2,
all cache levels must have the same cache line size.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{DRAM configuration}
\label{sec:param-dram}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

For configuring the DRAM system including the memory controllers and the DRAM
itself, the knobs shown below are available. 

\begin{tabular}{l l}
 \\
 \Verb+dram_frequency 0.8+ & \Verb+// dram frequency+ \\
 \Verb+dram_bus_width 4 + & \Verb+// dram bus width+ \\
 \Verb+dram_column 11+ & \Verb+// column access (CL) latency+ \\
 \Verb+dram_activate 25 + & \Verb+// row activate (RCD) latency+ \\
 \Verb+dram_precharge 10+ & \Verb+// precharge (RP) latency+ \\
 \Verb+dram_num_mc 2 + & \Verb+// number of memory controllers+ \\
 \Verb+dram_num_banks 8+ & \Verb+// number of banks per controller+ \\
 \Verb+dram_num_channel 2 + & \Verb+// number of dram channels per controller+ \\
 \Verb+dram_rowbuffer_size 2048 + & \Verb+// row buffer size+ \\
 \Verb+dram_scheduling_policy FRFCFS+ & \Verb+// dram scheduling policy+ \\
 \\
\end{tabular}

\noindent \SIM models three DRAM timing parameters - precharge ($t_{RP}$), activate
($t_{RCD}$), and column access ($t_{CL}$). While DRAM bandwidth is modeled using the
parameters \Verb+dram_frequency+, \Verb+dram_bus_width+, and
\Verb+dram_num_channel+. The maximum DRAM bandwidth can be calculated using
Equation~\ref{eq:bandwidth}. 

\begin{equation}
\label{eq:bandwidth}
max\_bandwidth = dram\_frequency \times dram\_bus\_width \times dram\_num\_mc \times dram\_num\_channel 
\end{equation}

\noindent For example, the maximum bandwidth of a DRAM system with the above parameter
values is

\Verb+ 800 MHz (0.8 GHz) * 4 Bytes+
\Verb+* 2 MCs * 2 Channels = 12.8GB/s+.
 

\noindent Currently, \SIM provides two memory scheduling policies: FCFS 
(First-Come-First-Serve) and FR-FCFS (First-Ready First-Come-First-Serve).



\ignore{
		\subsubsection{Hardware Prefetching}
		We provide stride hardware prefetcher~\cite{iac:spr04}. 
		}



% LocalWords:  macsim num sim SMT multi pre NVIDIA params GeForce gtx GTX ptx
% LocalWords:  GeForece Multipe SMs appl GPU cpu gpu CUDA GPUOcelot RCD mc GHz
% LocalWords:  precharge rowbuffer FRFCFS MCs FCFS Prefetching prefetcher alloc
% LocalWords:  icache Microarchitecture bp dir mech gshare isched msched fsched
% LocalWords:  FP io ooo rr th sched const mem





